Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays
نویسندگان
چکیده
We discuss a multiple-play multi-armed bandit (MAB) problem in which several arms are selected at each round. Recently, Thompson sampling (TS), a randomized algorithm with a Bayesian spirit, has attracted much attention for its empirically excellent performance, and it is revealed to have an optimal regret bound in the standard single-play MAB problem. In this paper, we propose the multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the multiple-play MAB problem, and discuss its regret analysis. We prove that MP-TS for binary rewards has the optimal regret upper bound that matches the regret lower bound provided by Anantharam et al. (1987). Therefore, MPTS is the first computationally efficient algorithm with optimal regret. A set of computer simulations was also conducted, which compared MPTS with state-of-the-art algorithms. We also propose a modification of MP-TS, which is shown to have better empirical performance.
منابع مشابه
Analysis of Thompson Sampling for the Multi-armed Bandit Problem
The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...
متن کاملStochastic Regret Minimization via Thompson Sampling
The Thompson Sampling (TS) policy is a widely implemented algorithm for the stochastic multiarmed bandit (MAB) problem. Given a prior distribution over possible parameter settings of the underlying reward distributions of the arms, at each time instant, the policy plays an arm with probability equal to the probability that this arm has largest mean reward conditioned on the current posterior di...
متن کاملThompson Sampling: An Asymptotically Optimal Finite-Time Analysis
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparis...
متن کاملAnalysis of Thompson Sampling for Stochastic Sleeping Bandits
We study a variant of the stochastic multiarmed bandit problem where the set of available arms varies arbitrarily with time (also known as the sleeping bandit problem). We focus on the Thompson Sampling algorithm and consider a regret notion defined with respect to the best available arm. Our main result is anO(log T ) regret bound for Thompson Sampling, which generalizes a similar bound known ...
متن کاملA Near-optimal Regret Bounds for Thompson Sampling
Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson ...
متن کامل